Kevin Williams Business Analytics Final Project¶
- Real Estate Data
- Concordia University - Wisconsin
- Professor John Fields
Project Information¶
- Data was Obtained from Kaggle.com and represents actual rental property data scraped from Craigslist.
- This project will use the OSEMN framework that stands for:
- Obtain
- Scrub
- Explore
- Model
- Interpret
- Our goal of this project is to perform experimental analysis on rental listings on Craigslist in the United States as a whole. Ultimately, we want to know what factors/variables impact the rent per month which leads into our research question.
- For the feasibility of this project, I will remove a big chunk of the data set below - this will allow VS code to run smoother.
Loading Neccessary Packages¶
pip install numpy scipy pandas matplotlib.pyplot plotly==5.24.1 statsmodels mlxtend
pip install --upgrade plotly jupyterlab
import numpy as np
import scipy as sci
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import statsmodels as stats
More packages will be added later into the project
Part One: Obtain¶
DF1 = pd.read_csv(r'c:\Users\Kevin\Desktop\Bus_Data_Anys_prog_class\Real_Estate Data.csv')
Part Two: Scrub¶
DF1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 384977 entries, 0 to 384976 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 384977 non-null int64 1 region 384977 non-null object 2 region_url 384977 non-null object 3 price 384977 non-null int64 4 type 384977 non-null object 5 sqfeet 384977 non-null int64 6 beds 384977 non-null int64 7 baths 384977 non-null float64 8 cats_allowed 384977 non-null int64 9 dogs_allowed 384977 non-null int64 10 smoking_allowed 384977 non-null int64 11 wheelchair_access 384977 non-null int64 12 electric_vehicle_charge 384977 non-null int64 13 comes_furnished 384977 non-null int64 14 laundry_options 305951 non-null object 15 parking_options 244290 non-null object 16 image_url 384977 non-null object 17 description 384975 non-null object 18 lat 383059 non-null float64 19 long 383059 non-null float64 20 state 384977 non-null object dtypes: float64(3), int64(10), object(8) memory usage: 61.7+ MB
- Dataset contains 21 columns.
- Dataset contains 384,977 rows of data.
Checking for Missing Values¶
DF1.isna().sum()
id 0 region 0 region_url 0 price 0 type 0 sqfeet 0 beds 0 baths 0 cats_allowed 0 dogs_allowed 0 smoking_allowed 0 wheelchair_access 0 electric_vehicle_charge 0 comes_furnished 0 laundry_options 79026 parking_options 140687 image_url 0 description 2 lat 1918 long 1918 state 0 dtype: int64
- Latitude and Longitude are missing 1,918 records.
- Because this is such a low number, compared to the amount of data, it would be simpler to omit the rows from the dataset.
- Parking options have 140,687 missing records; because this is much higher than anticipated, this column will be removed from the dataset. Imputing method on this column will create inaccuracy.
- Laundry column will also be removed from dataset.
Selecting Relevant Variables¶
DF2 = DF1.iloc[:50000, 3:21]
DF2.drop('laundry_options', axis= 'columns', inplace= True)
DF2.drop('parking_options', axis= 'columns', inplace= True)
DF2.drop('description', axis= 'columns', inplace= True)
DF2.drop('image_url', axis= 'columns', inplace= True)
Removing missing data from lat and long columns¶
DF2 = DF2.dropna()
DF2.isna().sum()
price 0 type 0 sqfeet 0 beds 0 baths 0 cats_allowed 0 dogs_allowed 0 smoking_allowed 0 wheelchair_access 0 electric_vehicle_charge 0 comes_furnished 0 lat 0 long 0 state 0 dtype: int64
Renaming Price Variable to "monthly_rent" and sqfeet variable to "squarefeet"¶
DF2.rename(columns= {'price' : 'monthly_rent', 'sqfeet': 'squarefeet'}, inplace= True)
Filtering data¶
# Filtering dataset
DF3 = DF2[(DF2['monthly_rent'] <= 12000) & (DF2['monthly_rent'] >= 500) &
(DF2['squarefeet'] >= 300) & (DF2['beds'] <= 5) & (DF2['beds'] >= 1) & (DF2['squarefeet']<= 5000) & (DF2['baths'] >= 1)]
Part Three: Explore¶
DF3.describe()
| monthly_rent | squarefeet | beds | baths | cats_allowed | dogs_allowed | smoking_allowed | wheelchair_access | electric_vehicle_charge | comes_furnished | lat | long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 47498.000000 | 47498.000000 | 47498.000000 | 47498.000000 | 47498.000000 | 47498.000000 | 47498.000000 | 47498.000000 | 47498.000000 | 47498.000000 | 47498.00000 | 47498.000000 |
| mean | 2330.226178 | 1267.278012 | 2.244705 | 1.798813 | 0.661502 | 0.648659 | 0.583267 | 0.131985 | 0.047286 | 0.076214 | 37.36345 | -100.144068 |
| std | 779.363888 | 553.623447 | 0.925673 | 0.647688 | 0.473204 | 0.477394 | 0.493023 | 0.338477 | 0.212253 | 0.265343 | 5.77784 | 21.861932 |
| min | 1700.000000 | 300.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -43.53330 | -163.894000 |
| 25% | 1850.000000 | 916.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 33.93770 | -119.062000 |
| 50% | 2100.000000 | 1134.000000 | 2.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 38.11310 | -104.595500 |
| 75% | 2500.000000 | 1453.000000 | 3.000000 | 2.000000 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 40.82220 | -77.401525 |
| max | 12000.000000 | 5000.000000 | 5.000000 | 7.500000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 64.99370 | 172.633000 |
Takeaways from descriptive statistics¶
- The average monthly rent is $2,330.00. The median is $2,100; because the mean is higher than the median, this would mean the data is skewed to the right. This is entails that more data lies on the left side of the distribution.
- The standard deviation of monthly rent is $779.00.
- To determine if this is high - we will compare it to the mean using coefficient of variation below. Coefficient of variation is a way to measure how spread out values are in a dataset relative to the mean. A value greater than 1 is often considered high.
- The average square-footage is 1,267. The median is 1,134 square-feet. Again, because the mean is higher than the median, this would tell us the data is skewed to the right.
- We will also use the coefficient of variation to determine if the standard deviation of 553 square-feet is high.
Coefficient of variation¶
# CV for Monthly Rent
STDRENT = 779.00
MEANRENT = 2330.00
def CV_Rent():
CVRent = round((STDRENT/MEANRENT) * 100)
print('The coefficent of variation for monthly rent is:',CVRent,'%')
CV_Rent()
The coefficent of variation for monthly rent is: 33 %
#CV for Square Footage
STDFOOTAGE = 553.00
MEANFOOTAGE = 1267.00
def CV_Footage():
CVFootage = round((STDFOOTAGE/MEANFOOTAGE) * 100)
print('The coefficent of variation for monthly rent is:',CVFootage,'%')
CV_Footage()
The coefficent of variation for monthly rent is: 44 %
Data Visualization¶
import plotly.io as pio
pio.renderers.default = 'notebook'
pip install seaborn
import seaborn as sns
Histograms¶
px.histogram(DF3, x= 'monthly_rent',
title= 'Histogram of Monthly Rent',
color_discrete_sequence= ['green'])
sci.stats.skew(DF3['monthly_rent'])
np.float64(3.763049120952408)
px.histogram(DF3, x= 'squarefeet',
title= 'Histogram of Square Footage',
color_discrete_sequence= ['green'])
sci.stats.skew(DF3['squarefeet'])
np.float64(1.8898599855496985)
Map¶
def DataMap():
MapData = px.scatter_mapbox(DF3,
lon= DF3['long'],
lat= DF3['lat'],
zoom= 4,
color= DF3['type'],
width= 1200,
height= 900,
size= DF3['monthly_rent'],
title= 'Map of Housing Data - This will help us visually view the location of our housing data')
MapData.update_layout(mapbox_style = 'open-street-map')
MapData.show()
DataMap()
Bar Charts¶
sns.set_style('dark')
sns.barplot(data= DF3, x= 'monthly_rent', y= 'type', estimator= 'mean', errorbar= None, edgecolor = 'blue',
color= 'orange')
plt.title('Average Monthly Rent by Living Type')
Text(0.5, 1.0, 'Average Monthly Rent by Living Type')
TypeMean = DF3.groupby('type').monthly_rent.mean()
TypeMean = pd.DataFrame(TypeMean)
TypeMean.sort_values(by= ['monthly_rent'], ascending= True).head(10)
| monthly_rent | |
|---|---|
| type | |
| apartment | 2274.021998 |
| townhouse | 2331.984159 |
| duplex | 2340.254470 |
| loft | 2380.051095 |
| cottage/cabin | 2413.612903 |
| in-law | 2414.642857 |
| flat | 2448.105882 |
| condo | 2519.454401 |
| house | 2525.232280 |
| manufactured | 2694.555556 |
plt.figure(figsize= (10,10))
sns.barplot(data= DF3, x= 'monthly_rent', y= 'state', estimator= 'mean', errorbar= None, color= 'orange',
edgecolor = 'blue')
plt.title('Average Monthly Rent by State')
Text(0.5, 1.0, 'Average Monthly Rent by State')
StateMean = DF3.groupby('state').monthly_rent.mean()
StateMean = pd.DataFrame(StateMean)
StateMean.sort_values(by= ['monthly_rent'], ascending= True)
| monthly_rent | |
|---|---|
| state | |
| sd | 1921.454545 |
| md | 1968.771429 |
| va | 1985.210623 |
| ut | 1989.693333 |
| nd | 2005.985401 |
| nh | 2010.847887 |
| mo | 2012.500000 |
| ak | 2023.294393 |
| in | 2030.371429 |
| nc | 2034.089583 |
| me | 2086.276190 |
| de | 2097.258216 |
| ky | 2101.745098 |
| ga | 2107.144444 |
| id | 2108.207171 |
| az | 2125.946541 |
| ia | 2126.296552 |
| ri | 2133.069444 |
| mi | 2134.658444 |
| nv | 2144.086207 |
| mt | 2144.212329 |
| ne | 2146.837838 |
| tx | 2146.914316 |
| ar | 2151.018519 |
| ms | 2154.000000 |
| wi | 2157.928952 |
| ct | 2164.863636 |
| or | 2180.410256 |
| oh | 2183.477477 |
| ks | 2193.142857 |
| pa | 2197.734093 |
| co | 2208.196516 |
| tn | 2209.748677 |
| mn | 2217.412760 |
| la | 2218.109929 |
| vt | 2221.559055 |
| fl | 2246.187007 |
| nj | 2293.441423 |
| ok | 2296.074074 |
| wy | 2320.000000 |
| al | 2324.387755 |
| dc | 2324.754717 |
| wa | 2325.148827 |
| ma | 2332.834593 |
| sc | 2341.305600 |
| nm | 2390.942857 |
| il | 2433.630731 |
| wv | 2434.400000 |
| ny | 2493.355991 |
| ca | 2517.291421 |
| hi | 2573.647325 |
Scatter Plot¶
px.scatter(DF3, x= 'squarefeet', y= 'monthly_rent', color = 'type',
title= 'Relationship between Monthly Rent & Sq Footage')
The Scatter plot appears to be noisy. We can use the "corr()" function below to see the correlation numerically.
Correlation¶
Correlation = DF3['squarefeet'].corr(DF3['monthly_rent'])
print(f' The correlation between Square feet and monthly rent is {Correlation}')
The correlation between Square feet and monthly rent is 0.2521794550715152
Advance Statistical Analysis¶
We are also interested to see if there is a statistically significant difference between the living types and their respected average monthly price. To do this we will use Anova
ANOVA = DF3.groupby('type').mean(numeric_only=True)
ANOVA
| monthly_rent | squarefeet | beds | baths | cats_allowed | dogs_allowed | smoking_allowed | wheelchair_access | electric_vehicle_charge | comes_furnished | lat | long | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| type | ||||||||||||
| apartment | 2274.021998 | 1082.162109 | 1.961174 | 1.653637 | 0.741835 | 0.708180 | 0.625153 | 0.160093 | 0.056968 | 0.055595 | 37.459169 | -98.990983 |
| condo | 2519.454401 | 1235.983033 | 2.066278 | 1.839873 | 0.362672 | 0.380700 | 0.458643 | 0.098091 | 0.027041 | 0.216331 | 35.009757 | -102.241810 |
| cottage/cabin | 2413.612903 | 1195.298387 | 2.016129 | 1.532258 | 0.354839 | 0.387097 | 0.403226 | 0.064516 | 0.016129 | 0.387097 | 34.321426 | -113.338343 |
| duplex | 2340.254470 | 1392.162311 | 2.617607 | 1.773728 | 0.449794 | 0.478680 | 0.412655 | 0.114168 | 0.013755 | 0.141678 | 37.978051 | -109.431757 |
| flat | 2448.105882 | 1214.664706 | 2.147059 | 1.500000 | 0.711765 | 0.652941 | 0.588235 | 0.382353 | 0.370588 | 0.135294 | 36.620680 | -94.980686 |
| house | 2525.232280 | 2003.198595 | 3.390168 | 2.277669 | 0.406945 | 0.472103 | 0.485109 | 0.039407 | 0.012095 | 0.119651 | 37.419490 | -104.168295 |
| in-law | 2414.642857 | 725.392857 | 1.357143 | 1.035714 | 0.071429 | 0.035714 | 0.178571 | 0.000000 | 0.035714 | 0.321429 | 35.161196 | -121.439161 |
| loft | 2380.051095 | 1252.153285 | 1.788321 | 1.755474 | 0.722628 | 0.708029 | 0.489051 | 0.211679 | 0.087591 | 0.065693 | 38.415204 | -86.580630 |
| manufactured | 2694.555556 | 1383.777778 | 2.622222 | 1.822222 | 0.422222 | 0.466667 | 0.511111 | 0.044444 | 0.000000 | 0.177778 | 38.010289 | -111.687129 |
| townhouse | 2331.984159 | 1541.903331 | 2.682778 | 2.342405 | 0.650690 | 0.647847 | 0.473193 | 0.046304 | 0.025995 | 0.076767 | 37.638376 | -99.251916 |
px.box(DF3, x='monthly_rent', y='type')
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels import *
#H0: There is no difference between the groups and thier monthly rent.
#HA: There is a difference between the groups and thier monthly rent.
new = ols('monthly_rent ~ type', data= DF3).fit()
an = sm.stats.anova_lm(new, typ = 1)
print(an)
df sum_sq mean_sq F PR(>F) type 9.0 4.778726e+08 5.309695e+07 88.871114 6.070463e-165 Residual 47488.0 2.837219e+10 5.974602e+05 NaN NaN
Are ANOVA test returned a p-value less than .05, meaning there is a statistically significant difference between the groups. To find out where the difference is, we will continue the analysis.
pair_t = new.t_test_pairwise('type')
pair_t.result_frame
| coef | std err | t | P>|t| | Conf. Int. Low | Conf. Int. Upp. | pvalue-hs | reject-hs | |
|---|---|---|---|---|---|---|---|---|
| condo-apartment | 245.432403 | 18.282265 | 13.424617 | 5.154720e-41 | 209.598908 | 281.265897 | 2.268077e-39 | True |
| cottage/cabin-apartment | 139.590905 | 69.539064 | 2.007374 | 4.471552e-02 | 3.293370 | 275.888440 | 7.346317e-01 | False |
| duplex-apartment | 66.232472 | 28.970160 | 2.286231 | 2.224514e-02 | 9.450554 | 123.014391 | 5.240192e-01 | False |
| flat-apartment | 174.083884 | 59.430008 | 2.929225 | 3.399695e-03 | 57.600240 | 290.567529 | 1.123627e-01 | False |
| house-apartment | 251.210282 | 9.754872 | 25.752290 | 3.033636e-145 | 232.090597 | 270.329966 | 1.365136e-143 | True |
| in-law-apartment | 140.620859 | 146.134587 | 0.962270 | 3.359191e-01 | -145.804969 | 427.046686 | 9.982979e-01 | False |
| loft-apartment | 106.029097 | 66.170064 | 1.602373 | 1.090799e-01 | -23.665152 | 235.723345 | 9.442854e-01 | False |
| manufactured-apartment | 420.533557 | 115.301114 | 3.647264 | 2.653313e-04 | 194.541766 | 646.525349 | 1.029592e-02 | True |
| townhouse-apartment | 57.962161 | 16.128452 | 3.593783 | 3.262407e-04 | 26.350169 | 89.574153 | 1.232261e-02 | True |
| cottage/cabin-condo | -105.841498 | 71.659013 | -1.477016 | 1.396780e-01 | -246.294162 | 34.611166 | 9.619229e-01 | False |
| duplex-condo | -179.199930 | 33.743190 | -5.310699 | 1.096956e-07 | -245.337053 | -113.062808 | 4.387817e-06 | True |
| flat-condo | -71.348518 | 61.897161 | -1.152695 | 2.490416e-01 | -192.667817 | 49.970780 | 9.942312e-01 | False |
| house-condo | 5.777879 | 19.861777 | 0.290904 | 7.711257e-01 | -33.151480 | 44.707238 | 9.998080e-01 | False |
| in-law-condo | -104.811544 | 147.155191 | -0.712252 | 4.763124e-01 | -393.237769 | 183.614681 | 9.997245e-01 | False |
| loft-condo | -139.403306 | 68.394518 | -2.038223 | 4.153313e-02 | -273.457514 | -5.349097 | 7.198999e-01 | False |
| manufactured-condo | 175.101155 | 116.591936 | 1.501829 | 1.331479e-01 | -53.420666 | 403.622975 | 9.619229e-01 | False |
| townhouse-condo | -187.470242 | 23.652899 | -7.925889 | 2.314236e-15 | -233.830253 | -141.110230 | 9.719790e-14 | True |
| duplex-cottage/cabin | -73.358433 | 75.100220 | -0.976807 | 3.286696e-01 | -220.555911 | 73.839046 | 9.982979e-01 | False |
| flat-cottage/cabin | 34.492979 | 91.283617 | 0.377866 | 7.055318e-01 | -144.424183 | 213.410141 | 9.998080e-01 | False |
| house-cottage/cabin | 111.619377 | 69.970925 | 1.595225 | 1.106686e-01 | -25.524613 | 248.763366 | 9.442854e-01 | False |
| in-law-cottage/cabin | 1.029954 | 161.728446 | 0.006368 | 9.949188e-01 | -315.960056 | 318.019963 | 9.998080e-01 | False |
| loft-cottage/cabin | -33.561808 | 95.808404 | -0.350301 | 7.261141e-01 | -221.347616 | 154.223999 | 9.998080e-01 | False |
| manufactured-cottage/cabin | 280.942652 | 134.518102 | 2.088512 | 3.675700e-02 | 17.285297 | 544.600008 | 6.868072e-01 | False |
| townhouse-cottage/cabin | -81.628744 | 71.140002 | -1.147438 | 2.512064e-01 | -221.064139 | 57.806651 | 9.942312e-01 | False |
| flat-duplex | 107.851412 | 65.850494 | 1.637822 | 1.014654e-01 | -21.216473 | 236.919297 | 9.380694e-01 | False |
| house-duplex | 184.977809 | 29.991985 | 6.167575 | 6.990296e-10 | 126.193101 | 243.762518 | 2.866021e-08 | True |
| in-law-duplex | 74.388387 | 148.861275 | 0.499716 | 6.172773e-01 | -217.381787 | 366.158560 | 9.998016e-01 | False |
| loft-duplex | 39.796624 | 71.991937 | 0.552793 | 5.804079e-01 | -101.308576 | 180.901825 | 9.998016e-01 | False |
| manufactured-duplex | 354.301085 | 118.737983 | 2.983890 | 2.847537e-03 | 121.572984 | 587.029187 | 9.756407e-02 | False |
| townhouse-duplex | -8.270311 | 32.626502 | -0.253484 | 7.998950e-01 | -72.218709 | 55.678087 | 9.998080e-01 | False |
| house-flat | 77.126398 | 59.934755 | 1.286839 | 1.981566e-01 | -40.346557 | 194.599352 | 9.849443e-01 | False |
| in-law-flat | -33.463025 | 157.646233 | -0.212267 | 8.319000e-01 | -342.451840 | 275.525789 | 9.998080e-01 | False |
| loft-flat | -68.054787 | 88.743984 | -0.766866 | 4.431647e-01 | -241.994233 | 105.884659 | 9.997245e-01 | False |
| manufactured-flat | 246.449673 | 129.581496 | 1.901889 | 5.719166e-02 | -7.531866 | 500.431213 | 8.077548e-01 | False |
| townhouse-flat | -116.121723 | 61.295549 | -1.894456 | 5.817057e-02 | -236.261853 | 4.018407 | 8.077548e-01 | False |
| in-law-house | -110.589423 | 146.340584 | -0.755699 | 4.498335e-01 | -397.419007 | 176.240161 | 9.997245e-01 | False |
| loft-house | -145.181185 | 66.623767 | -2.179120 | 2.932765e-02 | -275.764698 | -14.597672 | 6.142330e-01 | False |
| manufactured-house | 169.323276 | 115.562087 | 1.465215 | 1.428690e-01 | -57.180025 | 395.826577 | 9.619229e-01 | False |
| townhouse-house | -193.248121 | 17.899048 | -10.796558 | 3.842698e-27 | -228.330505 | -158.165737 | 1.652360e-25 | True |
| loft-in-law | -34.591762 | 160.308722 | -0.215782 | 8.291585e-01 | -348.799092 | 279.615567 | 9.998080e-01 | False |
| manufactured-in-law | 279.912698 | 186.050412 | 1.504499 | 1.324595e-01 | -84.748703 | 644.574100 | 9.619229e-01 | False |
| townhouse-in-law | -82.658698 | 146.903151 | -0.562675 | 5.736590e-01 | -370.590923 | 205.273527 | 9.998016e-01 | False |
| manufactured-loft | 314.504461 | 132.807815 | 2.368117 | 1.788286e-02 | 54.199291 | 574.809630 | 4.585580e-01 | False |
| townhouse-loft | -48.066936 | 67.850540 | -0.708424 | 4.786856e-01 | -181.054941 | 84.921069 | 9.997245e-01 | False |
| townhouse-manufactured | -362.571396 | 116.273666 | -3.118259 | 1.820316e-03 | -590.469402 | -134.673391 | 6.519101e-02 | False |
Results from ANOVA¶
- If we examine the far right column where the values say "True", these are values that represent a significant difference.
- Condo-apartment
- House-apartment
- Manufactured-apartment
- Townhouse-apartment
- Duplex-condo
- Townhouse-condo
- House-duplex
- Townhouse-house
Part Four: Model¶
DF3.info()
<class 'pandas.core.frame.DataFrame'> Index: 47498 entries, 229 to 49999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 monthly_rent 47498 non-null int64 1 type 47498 non-null object 2 squarefeet 47498 non-null int64 3 beds 47498 non-null int64 4 baths 47498 non-null float64 5 cats_allowed 47498 non-null int64 6 dogs_allowed 47498 non-null int64 7 smoking_allowed 47498 non-null int64 8 wheelchair_access 47498 non-null int64 9 electric_vehicle_charge 47498 non-null int64 10 comes_furnished 47498 non-null int64 11 lat 47498 non-null float64 12 long 47498 non-null float64 13 state 47498 non-null object dtypes: float64(3), int64(9), object(2) memory usage: 5.4+ MB
Multiple Regression Analysis¶
# Creating Dependent and Independent Variable Dataframes
X = DF3.drop(['monthly_rent','lat', 'long', 'state'], axis= 'columns')
y = DF3['monthly_rent'] # We are looking to predict monthly rent.
# Creating dummy variables for type column
X = pd.get_dummies(X, columns=['type'], drop_first= False)
print(X)
squarefeet beds baths cats_allowed dogs_allowed smoking_allowed \
229 2200 2 2.0 0 1 1
230 1984 2 2.0 0 0 0
231 1984 2 2.0 0 0 0
232 3450 3 3.5 0 0 0
233 5000 5 4.5 1 1 1
... ... ... ... ... ... ...
49995 1250 2 2.0 1 1 0
49996 1243 2 2.0 0 0 0
49997 1250 2 2.0 1 1 0
49998 1250 2 2.0 1 1 0
49999 1190 2 2.0 1 1 1
wheelchair_access electric_vehicle_charge comes_furnished \
229 0 0 0
230 0 0 1
231 0 0 1
232 0 0 1
233 0 0 0
... ... ... ...
49995 0 0 0
49996 0 0 0
49997 0 0 0
49998 0 0 0
49999 0 0 0
type_apartment type_condo type_cottage/cabin type_duplex type_flat \
229 False False False True False
230 False True False False False
231 False True False False False
232 True False False False False
233 False False False True False
... ... ... ... ... ...
49995 True False False False False
49996 False True False False False
49997 True False False False False
49998 True False False False False
49999 True False False False False
type_house type_in-law type_loft type_manufactured type_townhouse
229 False False False False False
230 False False False False False
231 False False False False False
232 False False False False False
233 False False False False False
... ... ... ... ... ...
49995 False False False False False
49996 False False False False False
49997 False False False False False
49998 False False False False False
49999 False False False False False
[47498 rows x 19 columns]
# Converting string data to 0 and 1
X = X.iloc[:, 0:18].astype(int)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 43)
from sklearn.linear_model import LinearRegression
OLS = LinearRegression()
OLS.fit(X_train, y_train)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
# Display the intercept and coefficients of the OLS model
print('Intercept is: ' + str(OLS.intercept_))
print()
print('The set of coefficients are: ' + str(OLS.coef_))
print()
print('The R-Squared value is: ' + str(OLS.score(X_train,y_train)))
Intercept is: 1603.423654424329 The set of coefficients are: [ 3.84598452e-01 -5.46718152e+01 1.42053228e+02 -3.04299590e+00 -3.58969173e+01 -4.52971521e+01 1.41273363e+02 2.90259897e+02 3.67705568e+02 1.27464318e+02 2.47651832e+02 1.27567977e+02 7.33833465e+01 1.43217913e+02 2.15902467e+01 4.05557505e+02 1.05804562e+02 4.61022101e+02] The R-Squared value is: 0.11231675179290102
# Predicting with OLS
y_pred = OLS.predict(X_test)
performance = pd.DataFrame({'Predictions': y_pred, 'Actual Values': y_test})
performance['error'] = performance['Actual Values'] - performance['Predictions']
performance.head()
| Predictions | Actual Values | error | |
|---|---|---|---|
| 31123 | 2282.162678 | 1960 | -322.162678 |
| 17539 | 2601.203160 | 2300 | -301.203160 |
| 35431 | 2401.435551 | 1893 | -508.435551 |
| 34746 | 2284.710567 | 1899 | -385.710567 |
| 24613 | 2143.921643 | 2100 | -43.921643 |
# Preparing data for plotting
performance.reset_index(drop = True, inplace= True)
performance.reset_index(inplace= True)
performance.head()
| index | Predictions | Actual Values | error | |
|---|---|---|---|---|
| 0 | 0 | 2282.162678 | 1960 | -322.162678 |
| 1 | 1 | 2601.203160 | 2300 | -301.203160 |
| 2 | 2 | 2401.435551 | 1893 | -508.435551 |
| 3 | 3 | 2284.710567 | 1899 | -385.710567 |
| 4 | 4 | 2143.921643 | 2100 | -43.921643 |
# Plot residual
fig = plt.figure(figsize=(15,5))
sns.barplot(data= performance[:50], x= 'index', y= 'error', color= 'blue')
plt.xlabel('Observations')
plt.ylabel('Residuals')
plt.show()
Interpreting Model Visualization¶
- We can see that our model both over predicted and under-predicted. For example - only examining the first 50 observations, observations with a positive residual value means the model under-predicted. and a negative residual value means the model over-predicted. A model witht he error terms closer to 0 is a better model.
X_train = sm.add_constant(X_train)
X_train.head()
Model = sm.OLS(y_train, X_train).fit()
Model.summary()
| Dep. Variable: | monthly_rent | R-squared: | 0.112 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.112 |
| Method: | Least Squares | F-statistic: | 267.0 |
| Date: | Sat, 28 Sep 2024 | Prob (F-statistic): | 0.00 |
| Time: | 14:57:26 | Log-Likelihood: | -3.0473e+05 |
| No. Observations: | 37998 | AIC: | 6.095e+05 |
| Df Residuals: | 37979 | BIC: | 6.097e+05 |
| Df Model: | 18 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | 1603.4237 | 23.796 | 67.381 | 0.000 | 1556.782 | 1650.065 |
| squarefeet | 0.3846 | 0.012 | 32.203 | 0.000 | 0.361 | 0.408 |
| beds | -54.6718 | 6.803 | -8.037 | 0.000 | -68.006 | -41.338 |
| baths | 142.0532 | 8.810 | 16.124 | 0.000 | 124.785 | 159.321 |
| cats_allowed | -3.0430 | 15.923 | -0.191 | 0.848 | -34.253 | 28.167 |
| dogs_allowed | -35.8969 | 15.452 | -2.323 | 0.020 | -66.183 | -5.611 |
| smoking_allowed | -45.2972 | 8.160 | -5.551 | 0.000 | -61.291 | -29.303 |
| wheelchair_access | 141.2734 | 12.264 | 11.519 | 0.000 | 117.235 | 165.312 |
| electric_vehicle_charge | 290.2599 | 18.783 | 15.453 | 0.000 | 253.444 | 327.075 |
| comes_furnished | 367.7056 | 14.711 | 24.995 | 0.000 | 338.871 | 396.540 |
| type_apartment | 127.4643 | 17.722 | 7.192 | 0.000 | 92.729 | 162.200 |
| type_condo | 247.6518 | 25.513 | 9.707 | 0.000 | 197.646 | 297.657 |
| type_cottage/cabin | 127.5680 | 77.156 | 1.653 | 0.098 | -23.659 | 278.795 |
| type_duplex | 73.3833 | 34.957 | 2.099 | 0.036 | 4.867 | 141.900 |
| type_flat | 143.2179 | 64.683 | 2.214 | 0.027 | 16.437 | 269.999 |
| type_house | 21.5902 | 19.722 | 1.095 | 0.274 | -17.066 | 60.246 |
| type_in-law | 405.5575 | 154.655 | 2.622 | 0.009 | 102.429 | 708.686 |
| type_loft | 105.8046 | 71.316 | 1.484 | 0.138 | -33.976 | 245.585 |
| type_manufactured | 461.0221 | 123.785 | 3.724 | 0.000 | 218.400 | 703.644 |
| Omnibus: | 29757.720 | Durbin-Watson: | 2.002 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 1009487.779 |
| Skew: | 3.509 | Prob(JB): | 0.00 |
| Kurtosis: | 27.256 | Cond. No. | 5.67e+04 |
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.67e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Part Five: Interpret¶
Someone interested in finding a place to rent via Craigslist could use this report to make strategic decisions on the most affordable locations to rent including based on type of living situation. For example - we examine that South Dakota is the most affordable at an average monthly rent of $1,921.45 and Hawaii being the most expensive at $2,573.64.One can also view the interactive map in which was created using the latitude and longitude data to give a precise location of the properties.
Furthermore, one can examine the average rent based on housing type while viewing if the means are statistically different by examining the ANOVA section of the report.
This report can also be used for someone interested in either real estate investing or just renting out their property. Thos individuals can examine not only the exploration section, but also the model section to see what features have an impact on monthly rent. However, this isnt the best model for two reasons:
- The R-square is low at .11 meaning that only 11% of the variation in monthly rent can be explained by the independent variables in the model. This means there are other variables out there thats not in the model that may explain more of the variation of monthly rent.
- The residual plot shows shows variation in the model, in otherwords the model under predicted and over predicted the monthly rent price.